Spatio-Temporal Analysis of America temperature in 2015

Zhu Lin

Introduction

Climate change and potential global warming have become the significant issue of the day, monitoring temperatures across the region, therefore, is becoming increasingly important in response to weather-related disasters.

The data files US_weather_2015 and weather_stations correspondingly contain temperature of the United States at different stations and and their locations (marked by longitude, latitude and elevation) from January 1 to December 31, 2015, a period of 365 days.

Data description

The following packages are used in this project.

library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(maps)
library(mapdata) 
library(maptools)
library(graphics)
library(gstat)
library(sp)
library(viridis)
library(fields)
library(gridExtra)
library(xts) 
library(spacetime) 

Data description

Read two data files US_weather_2015 and weather_stations

load("weather_data.RData")
load("weather_stations.RData")

Data description

Dataframe weather has five variables: stn, year, month, day and temp.

head(weather_data)
##      stn year month day temp
## 1 916520 2015     1   1 83.0
## 2 916700 2015     1   1 86.9
## 3 788460 2015     1   1 81.8
## 4 916600 2015     1   1 84.3
## 5 965650 2015     1   1 79.0
## 6 406370 2015     1   1 47.6

Data description

Dataframe loc has seven variables: stn, name, country, state, lat, lon and elev.

head(weather_stations)
##      stn                          name country state     lat     lon elev
## 1 423630 MISSISSIPPI CANYON OIL PLATFO      US    LA  28.160 -89.220   37
## 2 619760   SERGE-FROLOW (ILE TROMELIN)      US       -15.883  54.517   13
## 3 621010                   MOORED BUOY      US        50.600  -2.933 -999
## 4 621110                   MOORED BUOY      US        58.900  -0.200 -999
## 5 621130                   MOORED BUOY      US        58.400   0.300 -999
## 6 621140                   MOORED BUOY      US        58.700   1.100 -999

Monthly data

We can get a rough idea of the monthly temperature in the United States from the histogram and boxplot.

Monthly data

We can get a rough idea of the monthly temperature in the United States from the histogram and boxplot.

Data of Jan 1, 2015

To familiarize ourselves with the geography of the dataset, we will initially ignore the temporal component of the dataset and examine the spatial distribution of temperatures on a single day. Take day 1 as an example, therefore, we extract data of January 1, 2015.

## Extract day 1's data and sort it
head(day1)
##      stn year month day temp     lon    lat  elev num
## 1 722212 2015     1   1 59.6 -81.340 29.959   3.1   1
## 2 722250 2015     1   1 41.1 -85.000 32.333  71.0   1
## 3 725377 2015     1   1 21.8 -82.833 42.600 177.0   1
## 4 746710 2015     1   1 29.2 -87.483 36.667 174.7   1
## 5 720394 2015     1   1 36.4 -93.066 34.100  55.8   1
## 6 722251 2015     1   1 37.2 -92.588 32.514  94.8   1

Temperature distribution image of Jan 1, 2015

Since the temperature points are discrete with respect to the whole space (in the longitude-latitude-axis matrix, only 0.02% of elements are not NA), it is not available to show the United States temperatures by using the image.plot command in the fields package. In this case, we introduce the following method to plot the discrete points’ temperature distribution image.

Temperature distribution image of Jan 1, 2015

A pronounced temperature gradient is visible from highs of over 85.6 Fahrenheit in the north of the study area which is near to the equator to a low of -14.7 Fahrenheit towards the southern boundary. This is not only indicative of spatial correlation in the dataset, but it also shows that the data are not stationary, as the mean temperature must vary strongly with latitude.

According to the temperature distribution image of January 1, 2015, we can find that on January 1 in 2015, the temperature in most parts of the United States was concentrated at 30 Fahrenheit and extreme temperatures only account for a small proportion.

Mean and variance of US temperature by latitude on Jan 1 in 2015

Mean and variance of US temperature by longitude on Jan 1 in 2015

Mean and variance of US temperature by elevation on Jan 1 in 2015

Define train data and test data

Longitude, latitude, altitude and corresponding temperature were extracted from the data “day1”, 80% of which was training set and the rest was test set.

set.seed(123)
index = sample(nrow(Y_new),nrow(Y_new)*0.8)
train_X = X_new[index,]
test_X = X_new[-index,]
train_y = as.matrix(Y_new)[index]
test_y = as.matrix(Y_new)[-index]
train_Z = Z_new[index]
test_Z = Z_new[-index]

Model fitting

Model 1: Assume quadratic polynomial trend in P

library(fields)
par_est = spatialProcess(train_X,train_y,Z=train_Z,
                         mKrig.args = list(m = 3),
                         Distance = "rdist.earth",
                         cov.args = list(Covariance = "Matern",
                                         smoothness = 1))
# where m=3 means quadratic form
## Call:
## spatialProcess(x = train_X, y = train_y, Z = train_Z, mKrig.args = list(m = 3), 
##     cov.args = list(Covariance = "Matern", smoothness = 1), Distance = "rdist.earth")
##                                                          
##  Number of Observations:                          1850   
##  Degree of polynomial in fixed part:              2      
##  Total number of parameters in fixed part:        7      
##  Number of additional covariates (Z)              1      
##  MLE nugget variance ( sigma^2)                   4.565  
##  MLE process variance (rho)                       60.86  
##  MLE range parameter (theta, units of distance):  157.1  
##  Approx 95% lower bound:                          131    
##             upper bound:                          207.8  
##   Approx.  degrees of freedom for curve           609.2  
##     Standard Error of df estimate:                7.066  
##  Nonzero entries in covariance                    3422500
##  
##  Covariance Model: stationary.cov
##    Covariance function:   Matern
##    Non-default covariance arguments and their values 
##    Argument: Covariance  has the value(s): 
## [1] "Matern"
##    Argument: smoothness  has the value(s): 
## [1] 1
##    Argument: Distance  has the value(s): 
## [1] "rdist.earth"
##    Argument: theta  has the value(s): 
##    theta 
## 157.1435 
##    Argument: onlyUpper  has the value(s): 
## [1] FALSE
##    Argument: distMat  has the value(s): 
## [1] NA

Model 2: Assume linear polynomial trend in P

par_est2 = spatialProcess(train_X,train_y,Z=train_Z,
                         mKrig.args = list(m = 2),
                         Distance = "rdist.earth",
                         cov.args = list(Covariance = "Matern",
                                         smoothness = 1))
# where m=2 means linear form
## Call:
## spatialProcess(x = train_X, y = train_y, Z = train_Z, mKrig.args = list(m = 2), 
##     cov.args = list(Covariance = "Matern", smoothness = 1), Distance = "rdist.earth")
##                                                          
##  Number of Observations:                          1850   
##  Degree of polynomial in fixed part:              1      
##  Total number of parameters in fixed part:        4      
##  Number of additional covariates (Z)              1      
##  MLE nugget variance ( sigma^2)                   4.733  
##  MLE process variance (rho)                       139.8  
##  MLE range parameter (theta, units of distance):  257.7  
##  Approx 95% lower bound:                          198.8  
##             upper bound:                          384    
##   Approx.  degrees of freedom for curve           571.9  
##     Standard Error of df estimate:                6.899  
##  Nonzero entries in covariance                    3422500
##  
##  Covariance Model: stationary.cov
##    Covariance function:   Matern
##    Non-default covariance arguments and their values 
##    Argument: Covariance  has the value(s): 
## [1] "Matern"
##    Argument: smoothness  has the value(s): 
## [1] 1
##    Argument: Distance  has the value(s): 
## [1] "rdist.earth"
##    Argument: theta  has the value(s): 
##    theta 
## 257.6591 
##    Argument: onlyUpper  has the value(s): 
## [1] FALSE
##    Argument: distMat  has the value(s): 
## [1] NA

Model evaluation

As for quadratic polynomial trend

par_est$d
##              [,1]
## [1,] 566.58554323
## [2,]   6.22423087
## [3,] -10.68986435
## [4,]   0.02730697
## [5,]  -0.02599832
## [6,]   0.07907762
## [7,]  -0.00386215

The coefficient of Z is so small that it may be ignored.

par_est$eff.df
## [1] 609.1969
par_est$lnProfileLike
## [1] -4691.364
pre = predict(par_est, xnew = test_X, Z = test_Z)
RMSE = mean((pre-test_y)^2)/median(test_y)
RMSE
## [1] 0.1412453

Model evaluation

As for linear polynomial trend

par_est2$d
##               [,1]
## [1,] 105.311988795
## [2,]  -0.154836179
## [3,]  -2.111269961
## [4,]  -0.003804903

The coefficient of Z is so small that it may be ignored.

par_est2$eff.df
## [1] 571.8559
par_est2$lnProfileLike
## [1] -4702.465
pre2 = predict(par_est2, xnew = test_X, Z = test_Z)
RMSE2 = mean((pre2-test_y)^2)/median(test_y)
RMSE2
## [1] 0.1472223

Model evaluation

From above output, we can know that the log Profile likelihood for lambda of model 1 is larger than model 2, and this implies the good fitness of the former. Furthermore, the behavior of RMSE in model 1 is smaller than that in model 2. In view of goodness of fit and interpretability with relatively smaller RMSE, we choose model 1. In fact, the covariance model also can be selected, but the prediction effect of candidate models have no obvious difference, so the \(Mat\acute{e}rn\) model with flexible parameters is used to fit the model of the whole data set.

Prediction by Matern model

## parameter estimation
#(fields use great circle distances, approxiamte the earth by a ball)
par_est_whole = spatialProcess(train_X,train_y,Z=train_Z,
                               mKrig.args = list(m = 3),
                               Distance = "rdist.earth",
                               cov.args = list(Covariance = "Matern",smoothness = 1))
# where m=3 means quadratic form
print(par_est_whole)
sigma.square = as.numeric(par_est_whole$sigma.MLE)^2     # nugget variance (sigma^2)  4.225
rho = as.numeric(par_est_whole$rho.MLE)                  # process variance (rho)   60.12
theta = as.numeric(par_est_whole$theta.MLE)              # range parameter (theta)  150.646(miles) 

Construct the prediction region

## Create testing data by meshgrid  resolution of (60/50)° x (25/50)° of longitude and latitude
lon_num = 50
lat_num = 50
lon_seq = seq(-125,-65,length.out=lon_num)
lat_seq = seq(25,50,length.out=lat_num)

## Restrict the grid to the USA mainland
require(mapdata) 
tmp <- map('worldHires', 'Usa', fill=TRUE, plot=FALSE) 
require(maptools) 
US.boundary <- map2SpatialPolygons(tmp, IDs = tmp$names, proj4string = CRS(as.character(NA))) 
US.bbox.grid <- SpatialPoints(cbind(rep(lon_seq,length(lat_seq)), 
                                    rep(lat_seq,each=length(lon_seq)))) 
gridded(US.bbox.grid) <- TRUE
US.grid <- US.bbox.grid[!is.na(over(US.bbox.grid, US.boundary)),] 
Z = as.matrix(rep(mean(Z_new),length(US.grid@coords[,1])))
pre_whole = predict(par_est_whole, xnew = US.grid@coords,Z = Z) # Z is not necessary

Plot of krigged temperature

spatial pattern of the residuals after fitting

range(par_est_whole$residuals)
## [1] -13.14837  18.24184

The range of the fitted residuals is \((-13.14837,18.24184)\), and the bubble diagram reflects the overall fitted effect of the spatial model with the quadratic polynomial trend and \(Mat\acute{e}rn\) covariance, except that a few coordinates have a large deviation in the predicted value which could be a result of high elevation.